scgpt
Generative pre-trained models have achieved remarkable success in various domains such as natural language processing and computer vision. Specifically, the combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between linguistic constructs and cellular biology - where texts comprise words, similarly, cells are defined by genes - our study probes the applicability of foundation models to advance cellular biology and genetics research. Utilizing the burgeoning single-cell sequencing data, we have pioneered the construction of a foundation model for single-cell biology, scGPT, which is based on generative pre-trained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT, a generative pre-trained transformer, effectively distills critical biological insights concerning genes and cells. Through the further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference.
Haotian Cui , Chloe Wang , Hassaan Maan , Kuan Pang , Fengning Luo , Bo Wang ,
1000
This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.
!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.
scGPT is available on PyPI. To install scGPT, run the following command:
$ pip install scgpt
[Optional] We recommend using wandb for logging and visualization.
$ pip install wandb
For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.
$ git clone this-repo-url
$ cd scGPT
$ poetry install
Note: The flash-attn
dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.
Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human
model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well.
Model name | Description | Download |
---|---|---|
whole-human (recommended) | Pretrained on 33 million normal human cells. | link |
brain | Pretrained on 13.2 million brain cells. | link |
blood | Pretrained on 10.3 million blood and bone marrow cells. | link |
heart | Pretrained on 1.8 million heart cells | link |
lung | Pretrained on 2.1 million lung cells | link |
kidney | Pretrained on 814 thousand kidney cells | link |
pan-cancer | Pretrained on 5.7 million cells of various cancer types | link |
Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save
directory.
We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.
We sincerely thank the authors of following open-source projects:
@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}